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Method and Apparatus for Using Conditional Selectivity as Foundation for 
Exploiting Statistics on Query Expressions 

Technical Field 

The invention relates generally to the field of relational databases and specifically 
to the field of optimizing queries on databases. 

Background of the Invention 

Most query optimizers for relational database management systems (RDBMS) 
rely on axost model to choose the best possible query execution plan for a given query; 
Thus, the quality of the query execution plan depends on the accuracy of cost estimates. 
Cost estimates, in turn, crucially depend on cardinality estimations of various sub-plans 
(intermediate results) generated during optimization. Traditionally, query optimizers use 
statistics built over base tables for cardinality estimates, and assume independence while 
propagating these base-table statistics through the query plans. However, it is widely 
recognized that such cardinality estimates can be off by orders of magnitude. Therefore, 
the traditional propagation of statistics can lead the query optimizer to choose 
significantly low-quality execution plans. 

Summary of the Invention 

Using conditional selectivity as a framework for manipulating query plans to 

leverage statistical information on intermediate query results can result in more efficient 

query plans. A number of tuples returned by a database query having a set of predicates 
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that each reference a set of database tables can be approximated. The query is 
decomposed to form a product of partial conditional selectivity expressions. The partial 
conditional selectivity expressions are then matched with stored statistics on query 
expressions to obtain estimated partial conditional selectivity values. The selectivity of 
the query is then estimated by combining the obtained partial conditional selectivity 
results. The resulting query selectivity estimate can be multiplied by a Cartesian product 
of the tables referenced in the query to arrive at a cardinality .value. 

The decomposition of the query can be performed recursively by repeatedly 
separating conditional selectivity expressions into atomic decompositions. During 
matching an error can be associated with a selectivity estimation that is generated using a 
given statistic and those statistics with the lowest error may be selected to generate the 
query selectivity estimation. The error may be based on the difference between a statistic 
that is generated by an intermediate query result and a statistic on the corresponding base 
table. Statistics on query expressions that correspond to a subset of the predicates 
represented in a given selectivity expression may be considered for estimating the 
selectivity of the given selectivity expression. In an optimizer environment, the 
decomposition may be guided by the sub-plans generated by the optimizer. A wider 
variety of queries can be decomposed by transforming disjunctive query predicates into 
conjunctive query predicates. 

Brief Description of the Drawings 

The present invention is illustrated by way of example and not limitation in the 
figures of the accompanying drawings, in which: 



Figure 1 illustrates an exemplary operating environment for a system for 
evaluating database queries using statistics maintained on intermediate query results; 

Figure 2 is a block diagram of a prior art optimizer that can be used in conjunction 
with the present invention; 

Figure 3 is tree diagram for a query and two alternative execution sub-plans for a 
prior art optimizer; 

Figure 4 is a block diagram for a method for evaluating database queries using . 
statistics maintained on intermediate query results according to an embodiment of the 
present invention; 

"Figure 5 is block diagram of a memo table for an optimizer that implements the 
method of Figure 4; and 

Figure 6 is a tree diagram that illustrates, a coalescing grouping transformation of 
query that can be used in practice of an embodiment of the present invention; and 

, Figure 7 is a tree diagram that illustrates an invariant grouping transformation of a 
query that can be used in practice of an embodiment of the present invention. 

Detailed Description of the Preferred Embodiment 
Exemplary Operating Environment 

Figure 1 and the following discussion are intended to provide a brief, general 
description of a suitable computing environment in which the invention may be 
implemented. Although not required, the invention will be described in the general 
context of computer-executable instructions, such as program modules, being executed 
by a personal computer. Generally, program modules include routines, programs, 



objects, components, data structures, etc., that perform particular tasks or implement 
particular abstract data types. Moreover, those skilled in the art will appreciate that the 
invention may be practiced with other computer system configurations, including hand- 
held devices, multiprocessor systems, microprocessor-based or programmable consumer 
electronics, network PCs, minicomputers, mainframe computers, and the like. The 
invention may also be practiced in distributed computing environments where tasks are 
performed by remote processing devices that are linked through a communications 
network. In a distributed computing environment, program modules may be located in 
both local and remote memory storage devices. 

With reference to Figure 1, an exemplary system for implementing the invention 
includes a general purpose computing device in the form of a conventional personal 
computer 20, including a processing unit 21, a system memory 22, and a system bus 24 
that couples various system components including system memory 22 to processing unit 
21. System bus 23 may be any of several types of bus structures including a memory bus 
or memory controller, a peripheral bus, and a local bus using any of a variety of bus 
■ architectures. System memory 22 includes read only memory (ROM) 24 and random 
' access memory (RAM) 25. A basic input/output system (BIOS) 26, containing the basic - 
routines that help . to transfer information between elements within personal computer 20, 
such as during start-up, is stored in ROM 24. Personal computer 20 further includes a 
hard disk drive 27 for reading from and writing to a hard disk, a magnetic disk drive 28 
for reading from or writing to a removable magnetic disk 29 and an optical disk drive 30 
for reading from or writing to a removable optical disk 31 such as a CD ROM or other 
optical media. Hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are 
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connected to system bus 23 by a hard disk drive interface 32, a magnetic disk drive 
interface 33, and an optical drive interface 34, respectively. The drives and their 
associated computer-readable media provide nonvolatile storage of computer-readable 
instructions, data structures, program modules and other data for personal computer 20. 
Although the exemplary environment described herein employs a hard disk, a removable 
magnetic disk 29 and a removable optical disk 31, it should be appreciated by those 
skilled in the art that other types of computer-readable media which can store data that is 
accessible by computer, such as random access memories (RAMs), read only memories 
(ROMs), and the like may also be used in the exemplary operating environment. 

A number of program modules may be stored on the hard disk, magnetic disk 129, 
optical disk 31, ROM 24 or RAM 25, including an operating. system 35, one or more 
application programs 36, other program modules 37, and program data 38. A database 
system 55 may also be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 
24 or RAM 25. A user may enter commands and information into personal computer 20 
through input devices such as a keyboard 40 and pointing device 42. Other input devices 
may include a microphone, joystick, game pad, satellite dish, scanner, or the like: These 
and other input devices are often connected to processing unit 21 through a serial port 
interface 46 that is coupled to system bus 23, but may be connected by other interfaces, 
such as a parallel port, game port or a universal serial bus (USB). A monitor 47 or other 
type of display device is also connected to system bus 23 via an interface, such as a video 
adapter 48. In addition to the monitor, personal computers typically include other 
peripheral output devices such as speakers and printers. 
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Personal computer 20 may operate in a networked environment using logical 
connections to one or more remote computers, such as a remote computer 49. Remote 
computer 49 may be another personal computer, a server, a router, a network PC, a peer 
device or other common network node, and typically includes many or all of the elements 
described above relative to personal computer 20, although only a memory storage device 
50 has been illustrated in Figure 1. The logical connections depicted in Figure 1 include 
local area network (LAN) 51 and a wide area network (WAN) 52. Such networking 
environments are commonplace in offices, enterprise-wide computer networks, intranets, 
and the Internet. 

When using a LAN networking environment, personal computer 20 is connected 
to local network 51 through a network interface or adapter 53. When used in a WAN 
networking environment, personal computer 20 typically includes a modem 54 or other 
means for establishing communication over wide area network 52, such as the Internet. 
Modem 54, which may be internal or external, is connected to system bus 23 via serial 
port interface 46. In a networked environment, program modules depicted relative to 
personal computer 20, or portions thereof, may be stored in remote memory storage 
device 50. It will be appreciated that the network connections shown are exemplary and 
other means of establishing a communications link between the computers may be used. 

Cost Estimation Using Cardinality Estimates Based on Statistics on Intermediate Tables 

SITs are statistics built over the results of query expressions or intermediate 
tables, and their purpose is to eliminate error propagation through query plan operators. 
For the purposes of this description, a SIT is defined as follows: Let R be a table, A an 



attribute of R, and Q an SQL query that contains R.A in the SELECT clause. $\T(RA\Q) 
is the statistic for attribute A on the result of the executing query expression Q. Q is 
called the generating query expression of SIT (R.A\Q). This definition can be extended 
for multi-attribute statistics. Furthermore, the definition can be used as the basis for 
extending the CREATE STATISTICS statement in SQL where instead of specifying the 
table ; name of the query, more general query expression such as a table valued expression 
can be used. 

In. U.S. Patent Application Serial No. 10/191,822, incorporated herein by 
reference in its entirety, the concept of SITs was introduced. A particular method of 
adapting a prior art query optimizer to access and utilize a preexisting set of SITs for cost 
estimation was described in detail in this application, which method is summarized here 
briefly as background information. 

Referring to Figure 2, the query optimizer examines an input query and generates 
a query execution plan that most efficiently returns the results sought by the query in 
terms of cost. The cost estimation module and its imbedded cardinality estimation 
module can be modified to utilize statistics on query expressions, or intermediate tables v 
to improve the accuracy of cardinality estimates. 

In general, the use of SITs is enabled by implementing a wrapper (shown in . 
phantom in Figure 2) on top of the original cardinality estimation module of the RDBMS. 
During the optimization of a single query, the wrapper will be called many times, once 
for each different query sub-plan enumerated by the optimizer. Each time the query 
optimizer invokes the modified cardinality estimation module with a query plan, this 
input plan is transformed by the wrapper into another one that exploits SITs. The 
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cardinality estimation module uses the input plan to arrive at a potentially more accurate 
cardinality estimation that is returned to the query optimizer. The transformed query plan 
is thus a temporary structure used by the modified cardinality and is not used for query 
execution. 

According to the embodiment described in application serial number 10/191,822, 
the transformed plan that is passed to the cardinality estimation module exploits 
applicable SITs to enable a potentially more accurate cardinality estimate. The original 
cardinality estimation module requires little or no modification to accept the transformed 
plan as input. The transformation of plans is performed efficiently, which is important 
because the transformation will be used for several sub-plans for a single query 
optimization. 

In general, there will be no SIT that matches a given plan exactly. Instead, 
several SITs might be used for to some (perhaps overlapping) portions of the input plan. 
The embodiment described in application serial number 10/191,822 integrates SITs with 
cardinality estimation routines by transforming the input plan into an equivalent one that 
exploits SITs as much as possible. The transformation step is based on a greedy 
procedure that selects which SITs to apply at each iteration, so that the number of 
independence assumptions during the estimation for the transformed query plan is 
minimized. Identifying whether or not a SIT is applicable to a given plan leverages 
materialized view matching techniques as can be seen in the following example. 

In the query shown in Figure 3(a) RxS and Rx T are (skewed) foreign-key 
joins. Only a few tuples in S and T verify predicates a s . a <io(S) and <Jr.b>2o(T) and most 
tuples in R join precisely with these tuples in S and T. In the absence of SITs, 



independence is assumed between all predicates and the selectivity of the original query 
is estimated as the product of individual join and filter selectivity values. This will 
produce a very small number, clearly a gross underestimate of the selectivity value. In 
the presence of the two SITs shown in Figure 3, the two maximal equivalent rewritings 
shown in Figure 3(b) and 3(c) are explored and one of them is selected as the transformed 
query plan. Each alternative exploits one available SIT and therefore takes into 
consideration correlations introduced by one of the skewed joins. Thus, the resulting 
estimations, although not perfect, have considerably better quality than when base-tables 
statistics are used. 

Because the previous example, employed view matching techniques as the main 
engine to guide transformations, no alternative was explored that exploited both SITs 
simultaneously. This is a fundamental constraint that results from relying exclusively on 
materialized view matching to enumerate alternatives. Therefore it is desirable to 
supplement the enumerated alternatives from materialized view matching with 7 additional 
alternatives that leverage multiple SITs simultaneously. This is accomplished by using 
conditional selectivity as a formal framework to reason with selectivity values to identify 
and exploit SITs for cardinality estimation. 
Conditional Selectivity 

The concept of conditional selectivity allows expression of a given selectivity 
value in many different but equivalent ways. This description will focus on conjunctive 
Select Project Join queries, but the methods herein can be extended to handle more 
general queries, . 1 . 
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An arbitrary SPJ query is represented in a canonical form by first forming the 
Cartesian product of the tables referenced in the query, then applying all predicates 
(including joins) to the Cartesian product, and projecting out the desired attributes. Thus, 
an SPJ query is represented as: 

where ag are attributes of Rj x . . . x R n , and p t are predicates over Ri x...x R n (e.g. 
Ri.a<25, or R } .x=R 2 .y). 

Each set of predicates {p,} that is applied to R/ x . x R„ results in the subset of 
tuples that simultaneously verify all Using bag semantics, projections do not change 
the size of the output, and therefore projections are omitted from consideration when 
estimating cardinalities. To estimate the size of the output/or its cardinality, the fraction 
of tuples in Rj x . . . x R n that simultaneously verify all predicates /?/ (i.e. the selectivity of 
all fi) is approximated, and then this fraction is multiplied by \Rj x . . . x R n \ 9 which can be 
obtained by simple lookups over the system catalogs. The use of selectivities to obtain 
cardinalities results in simpler derivations. The classical definition of selectivity is 
extended as follows: 

Let K= {Ri,...,R n } be a set of tables, and P={pi 9 ,..j>j} 9 Q={qi,--.,Pk} be sets of 
predicates oyer R x ^ Rj x . . . x R n . The selectivity of p with respect to <7 q i a . . . a qk (R*), 
denoted Sel R (P\Q) y is defined as the fraction of tuples in cr q} a ... . a qk (R x ) that 
simultaneously verify all predicates in P. Therefore, 

i* 7 * a - a />, K, a - a ** ( R i x - xR njy\ 



Sel£P\Q) = 
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If Q = 0, this reduces to Sel^P), which agrees with the traditional definition of 
selectivity. 

In this description, tables(P) denotes the set of tables referenced by a set of 

predicates P, and attr{P) denotes the set of attributes mentioned in P. To simplify the 

notation, "P,Q" denotes Tu g'Vand "p,Q" denotes " {p} u Q\ where p is a predicate and 

P and Q are sets of predicates. For example, given the following query: 

SELECT * FROM R,S,T 

WHERE R.x-S.y AND S.a<10 and T.b>5 

the selectivity of q, Sel{ Rt s > T}(R.x=S.y,S.a<10 t T.b>5) is the fraction of tuples in RST that 
verify all predicates. Additionally, tables(R:x=S.y;S.a<10) = {R,S}, and 
' aUr(R.x^S.y,S.a<10) = {R.x,S.y,S.a} . 

In general the task is to estimate Seljj?!,. . .pk) for a given query a pI a ... a p * 

(tf). Two properties, atomic decomposition and separable decomposition, are verified by 
conditional selectivity values and allow a given selectivity to be expressed in many 
equivalent ways. Proofs of the properties are omitted. 

Atomic decomposition is based on the notion of conditional probability and 
unfolds a selectivity value as the product of two related selectivity values: 

SelJJ> t Q) = Sel^mySel^q) 

The property of atomic decomposition holds for arbitrary sets of predicates and tables, 
without relying on any assumption, such as independence. By repeatedly applying 
atomic decompositions over an initial selectivity value S, a very large number of 
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alternative rewritings for S can be obtained, which are called decompositions. The 
number of different decompositions of Seljipl,. ..,/?„), denoted by T(n), is bounded as 
follows: 0.5(w+l)!< T(n) <\.5 n n\ forn<l. 

In the presence of exact selectivity information, each possible decomposition of 
Sel^P) results in the same selectivity value (since each decomposition is obtained 

through a series of equalities). In reality, exact information may not be available. 
Instead, a set of SITs is maintained and used to approximate selectivity values. In such 
cases, depending on the available SITs, some decompositions might be more accurate 
than others. To determine which decompositions are more accurate, a measure of how 
accurately S can be approximated using the current set of available SITs is assigned to 
each decomposition S of Sel^P). Then approximating Sel^JP) can be treated as an 

optimization problem in which the "most accurate" decomposition oiSelJJP) for the 
given set of available SITs is sought. 

A naive approach to this problem would explore exhaustively all possible 
decompositions of SelJJP), estimate the accuracy of each decomposition and return the 

most accurate one. To improve on this approach, the notion of separability is used. 
Separability is a syntactic property of conditional selectivity values that can substantially 
reduce the space of decompositions without missing any useful one. It is said that 
SeljP) is separable (with Q possibly empty) if non-empty sets Xj and X 2 can be found 
such that Pu Q = X } \jX 2 and tables{X } ) n tables(X 2 )=0. In that case, Xj and X 2 are said 
to separate Sel^P). \ For example, given P={r.6=5,S.a<10}, Q={R.x=S.y} y and S= 
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Sel { R tS ,T}(P\Q\ X } = {7:6=5} andX 2 = {R.x=S.y,S.a<lO\ separate S. This is because 
tables(JO}={r} and tables (X 2 ) = If S.y = T.z were added to £), the resulting 

selectivity expression is no longer separable. 

Intuitively, Sel^P \Q) is separable if cr PAe (i?^ ) combines some tables by using 

Cartesian products. It is important to note, however, that even if the original query does 
riot use any Cartesian product, after applying atomic decompositions some of its factors 
can become separable. The property of separable decomposition, which is applicable 
where the independence assumption is guaranteed to hold, follows: 

Given that {Pj f P2} and {QlQi} are partitions of P and £>, respectively, and 
, X y =P/u gy an^ / . 

separated into Sel^P^Qj) - Sel^Qi)* For example, {T.t> == 5} and {R.x=S.y, S.a < 

10}can be separated into S = Self Rf s f Tj( T.b = 5,S.a < 10| R.x=S.y) which yields S= Sel( Rf s}( 
R.x=S.y y S.a < 10) • Sel m (T.b =? 5)..,. 

Using the separable decomposition property, it can be assumed that if #is a 
statistic that approximates Sel^P\Q) and Sel^ (P\Q) is separable as Sel^ (Pi\Qi) * Se/^ 
(P2\Qi) then there are two statistics ^ and ^ that approximate (P/|g//and Sel^ 
(P2\Q,2) such that: 1) H and combined require at most as much space as #does, and 
2) the approxiihation using ^ and 7^ is as accurate as that of H. For example, Sel{ RtS }{- 

R.a<\0,S.b > 20), is separable as Sel (R} { R.a < 10) • Sel {S} ( S.a > 20). In this situation, 
using two uni-dimensional histograms H(R.a) and H(S.b) to estimate each factor and then 
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multiplying the resulting selectivity values assuming independence (which is correct in 
this case) will be at least as accurate as using directly a two dimensional histogram 
H(R.a,S.b) built on RxS. In fact, the independence assumption holds in this case, so the 
joint distribution over {R.q,S.b} can be estimated correctly from uni-dimensional 
distributions over R.a and S.b. For that reason, statistics that directly approximate 
separable factors of decompositions do not need to be maintained since such statistics can 
be replaced by more accurate and space-efficient ones. Therefore, all decompositions S = 

5; S n for which some 5/ is separable can be discarded without missing the most 
accurate decompositions. 

The separable decomposition property and the above assumption can substantially 
reduce the search space, since consideration of large subsets of decompositions can be 
avoided. However, in many cases the search space is still very large. To make the 
optimization problem manageable, some restrictions can be imposed on the way the 
accuracy of decomposition is measured. A dynamic-programming algorithm can then 
return the most accurate decomposition for a given selectivity value, provided that the 
function that measures the accuracy of the decompositions is both monotonic and 
algebraic. . 

The error of a decomposition, which measures the accuracy of the available set of 
statistics approximating the decomposition, must verify two properties, monotonicity and 
algebraic aggregation. Given S= Sel /pi,.-.,p n )is a selectivity value and S= Sr-..'Sk is 

a non-separable decomposition of cf such that $ = Sel^Pi\Qi). If statistic Hi is used to 

approximate £, the error(/#,c$) is the level of accuracy of H approximating & The 
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value error(/j£,e5i) is a positive real number, where smaller values represent better 
accuracy. The estimated overall error for S~ Si . . . Sk is given by an aggregate function 
E(ei,. :.,e„), where e t = error(?{ h S). 

E is monotonic if every time that < e', for all /, E(ej,...,e n ) < E(e'j,.. .,e' n ). 
Monotohicity is a reasonable property for aggregate functions representing overall 
accuracy: if each individual error e\ is at least as high as error e h then the overall 
E(e'j 9 . . .,e } n ) would be expected to be at lest as high as E{e h . . .,e n ). 

F is. distributive if there is a function, G such that F(x/,. . . ,x n ) = G(F(x y ,. . .,x„), 
F(x,+/,. . .,x„)). Two examples of distributive aggregates are max (with G-max) and 
count (with G=sum).. In general, E is algebraic if there are distributive functions 
F/,. . .,F m and a function H such that £(xy,. . .,x„) =i/(F/(x/,. . .,x„) v . ,,F w (x/,. . .,x„)). For 
example avg is algebraic with F/=sum, F2=count, and H(x,y)=x/y. For simplicity, for 
an algebraic E, E merge (E(x } ,. . t9 xi} 9 . . . 9 (E(x i+ j ? .. .,"x„)) is defined as £(x ; ,. . .,x„). Therefore, 
avg m e r ge(avg(l,2),avg(3,4))=avg(l,2,3,4). 

Monotonicity imposes the principle of optimality for error values, and allows a 
dynamic programming strategy to find the optimal decomposition oiSel^P). The 
principle of optimality states that the components of a globally optimal solution are 
themselves optimal. Therefore the most accurate decomposition of Sel^P) can be found 

by trying all atomic decompositions Sel^P\Q) = SeljfP f \Q) '* SelJ^Q), recursively 
obtaining the optimal decomposition of Sel^Q), and combining the partial results. In 
turn, the key property of an algebraic aggregate E is that a small fixed-size vector can 



summarize sub-aggregations and therefore the amount of information needed to carry 
over between recursive calls to calculate error values can be bounded. 

Building on the decomposition and error principles discussed above, Figure 4 
illustrates a recursive algorithm "getSelectivity" designated generally as 400 for obtaining 
an accurate approximation of a selectivity value. In general, getSelectivity separates a 
selectivity value into simpler factors and then recursively calls itself to obtain partial 
selectivity values that are then combined to obtain the requested selectivity value. The 
algorithm relies on the error function being monotonic and algebraic, and avoids 
considering decompositions with separable factors. The pruning technique uses the fact 
that there is always a unique decomposition of Sel^(P) into non-separable factors of the 

form Sel^Pt). In other words, given a desired Sel^P), and repeatedly applying the 
separable decomposition property , until no single resulting factor is separable, the same 
non-separable decomposition of Sel^P) will result. 

In step 410, the algorithm considers an input predicate P over a set of tables R. 
The algorithm first checks to see if Sel^P) has already been stored in a memoization 

table indicated as 490. If the value is stored, the algorithm returns that value and the 
process ends. If the value has not yet been stored, the algorithm determines if the input 
selectivity value predicate Sel^P) is separable and if so separates Sel J^P) into i factors 

(step 420). For each factor, getSelectivity is recursively called (Step 460) and the optimal 
decomposition is obtained for each factor. Then, partial results and errors are combined 
in steps 470 and 475 and returned. Otherwise, Sel^P) is not separable and it is passed to 
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steps 430 and 440'where all atomic decompositions SelJP'\Q) • SelJQ) are tried. For 
each alternative decomposition, SelJQ) is recursively passed to getSelectivity (Step 460). 
Additionally, in step 450 SelJP'\Q) is approximated using the best available SITs 

among the set of available statistics 455. If no single statistic is available in step 450 the 
error P \q is set to oo and another atomic decomposition of the factor is considered. . After 
all atomic decompositions are explored in steps 440 and 450 the most accurate estimation 
for SelJP) (493) and its associated error is calculated in steps 470 and 475 and returned 
(and stored in the table 490). As a byproduct of getSelectivity, the most accurate 
. .selectivity estimation for every sub-query a P {R*) with P'cP is obtained. It can be 

shown that get Selectivity (R,P). returns the most accurate approximation of SelJP) for a 

given definition of error among all non-separable decompositions. A pseudo code 
implementation of getSelectivity follows: 

getSelectivity (ft:tables, P:predicates oyer ft*) 

Returns (SelJP), error/>) such that error/* is best among all non-separable decompositions 



01 if (SelJP)) was already calculated) 

02 ( SelJP)), errors) = memoization_table_lookup(P) 

03 . else if SelJJ?) is separable 

04 get the standard decomposition of SelJJ*): 
* ' SelJP)- SelrfPj) • .. t . - SeljPn) 

05 v . (Spi, errorp/) = getSelectivity(^yP,) (for each i=l..n) 

06 .Sp~Spj...S pn 

07 error/>=£ mgrgc (errorpy,,.-., errors) 



08 else // Sel^P) is non-separable 

09 errorp ?= oo ; bestK = NULL 

10 foreachP , £ P,Q = P-P' 

// check atomic decomposition Sel£P f \Q)) • Sel^Q)) 

11 (5eerror e ) = getSelectivity(K,Q) 

12 (H, errorp|^) = best statistic (along with the estimated error) to 
approximate Sel^P '\Q)) \ 

13 ' if .(^ m ^(errorp- lej error e ) < errorp) 

14 . . S p = E mer g e (evvovp '\ Q , error 0 ) 

15 bestK=K 

16 Sp \Q = estimation of Sel^P'\Q) using bestH 

17 : S p ,= S p -\q;Sq 

18 memoization_table_insert(P, 5^, errors) 

19 return (Sp, errorp) 

The worst-case complexity of getSelectivity is (9(3*), where n is the number of 

input predicates. In fact, the number of different invocations of getSelectivity is at most 
2 n , on for each subset of P. Due to memoization, only the first invocation to 
v getSelectivity for each subset of P actually produces some work (the others are simple 
lookups). The running time of getSelectivity for k input predicates (not counting 
recursive calls) is G(k 2 ) for separable factors and (9(2*) for non-separable factors. 

Therefore the complexity of getSelectivity is 0( Y" {) • 2 k ), or (9(3"). In turn, the 
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space complexity of get Selectivity is (9(2 n ) to store in the memoization table selectivity 
and error values for Sel'Jj?) with p^P. 

The worst-case complexity of getSelectivity,- (9(3"), can be contrasted with the 
lower bound of possible decompositions of a predicate, O ((w+l)!). Since («+l)!/3" is 

Q (2"), by using monotonic error functions the number of decompositions that are 
explored is decreased exponentially without missing the most accurate one. If many 
subsets of P are separable, the complexity of getSelectivity is further reduced, since 
smaller problems are solved independently. For instance if SelJ^P) = Sel^(Pj) • 

Sel^jJPi), where \Pi\=kj and \P 2 \=k 2 , the worst case running time of getSelectivity is 

0(3 kJ +3 k2 \ which is much smaller than 0(3 k!+k2 ). 

In step 450, getSelectivity obtains the statistic H to approximate Sel^P\Q) that 
minimizes error(K, Sel^(P\Q)). This procedure consists of l)obtaining the set of 
candidate statistics that can approximate SelJj?\Q) and 2)selecting from the candidate set 
the statistic H that minimizes error(K, Sel^p \Q)). 

In general, a statistic H consists of a set of SITs For simplicity the notation is 
modified to represent SITs as follows. Given query expression q = <5 pJ A... a^CR x ), 
SIT K (ay,. . .,a,|p/,. . .j>k) will be used instead of SIT(a y ,. . .,aj\q). That is, the set of 
predicates of q over K K is enumerated, which agrees with the notation for selectivity 
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values. It should be noted that for the purposes of this discussion SITs will be described 
as histograms, but the general ideas can be applied to other statistical estimators as well 
Therefore H R (a/,. . .,a 7 -|p/,, . .,pk) is a multidimensional histogram over attributes a/,. . ay- 
built on the result of executing a p i a ■ . . . a p aCR x ). As a special case, if there are no 
predicates p i9 K K (a/,. . .,a y |) is written, which is a traditional base-table histogram. The 
notion of predicate independence is used to define the set of candidate statistics to 
consider for approximating a given Sel£P\Q). ; 

Given sets of predicates P/, /^, and Q it is said that Pj and P 2 are independent 
with respect to Q if the following equality holds Sel^(P I ,P 2 \Q) = SelJ,P } \Q) ■ Sel H (P 2 \Q) 
where Rj^tdbles(P/ y Q) mdR 2 = tables(P 2 ,0. If-JP/ and P 2 are independent with respect to 
Q, then Sel K (Pi\P 2t Q)- Sel^(Pj\Q) holds as well. If there is no available statistic 

approximating Sel £p\Q), but there is an available statistic H approximating SelJjy\Q 
independence between P and Q' is assumed with respect to Q-Q' and ft is used to' 
approximate SelJj?\Q). This idea is used to define a candidate set of statistics to ; 
approximate Sel£P\Q). "* 

Given that 5= Sel£P\Q) where P is a set of filter predicates, such as {R.a<5 

S.b>8}, the candidate statistics to approximate 5 are all t {H(A\Q'))} that simultaneously 

verify the following three properties. 1) attr{P) c A (the SIT can estimate the 
predicates). 2) Q\g: Q (assuming independence between P and Q-Q*). In a traditional 



* optimizer, Q'=0, so P and Q are always assumed independent. 3) Q is maximal, i.e., 
there is no H K (A\Q") available such that (7c Q *c Q. 

In principle, the set of candidate statistics can be defined in a more flexible way, 
e.g., including statistics of the form H K (A\Q'), where Q' subsumes Q. The candidate sets 

of statistics are restricted as described above, to provide a good tradeoff between the 
efficiency to identify them and the quality of the resulting approximations. For example 
given 5- Sel^(R.a<5\p h p 2 ) and the statistics H K {R.a\p!), H K (R.a\p 2 ), H R (R.a\p If p 2 ,p 3 ), 

and H R (R.a), the set of candidate statistics for S include {H R (R.a\pj)}and {H R (R.a\p 2 )}. 

H R (R.a) does not qualify since its query expression is not maximal; and H R (R.a\p It p 2 ,p 3 \ 
does not qualify since it contains an extra predicate/??. • 

In many cases a predicate P is composed of both filter and join predicates, e.g., 
P={T.a < 10, R.x=S.y, S.b > 5} . To find Sel£P\Q) in this case severaLobservations about 

histograms are used. If Hi = H R {xJ(\Q) and H 2 = H R (y,Y\Q) and both are SITs, the join 

Hi>< x =y H 2 returns not only the value Selpc-y\Q) for the join, but also a new 

histogram Hj = H R (x,X 9 Y\x=y 9 Q). Therefore Hj can be used to estimate the remaining 

predicates involving attributes x(=y), X, and Y. As an example, to find Sel^R.a < 5, 

R.x=S.y\Q) given histograms Hj = H R i(R.x y R.a\Q) and H2 = H R 2(S.y\Q% the join 

Hi>< R . x =s.y H 2 returns the scalar selectivity value sj= Sel ^(R.x=S.y \Q) and also H3 = 

HJ,R.a\R.x=S.y,Q). The selectivity of Sel £R.x=S.y,R. a < J|0 is then conceptually 



obtained by the following atomic decomposition: s r s 2 = = Sel^R.a < 5\R.x=S.y,Q) • 
Sel ^(R.x=S.y\Q\ where s 2 is estimated using H 3 - 

As the example shows, Sel£P\Q) can be approximated by getting SITs covering 
all attributes in P, joining the SITs, and estimating the remaining range predicates in P. 
In general, the set of candidate statistics to approximate Sel£P\Q) is conceptually 

obtained as follows: 1) All join predicates in P are transformed to pairs of wildcard 
selection predicates P\ For instance, predicate R.x=S.y is replaced by the pair (R.x = ?, 
S.y=?X and therefore Sel^R.x =S.y, T.a<10, S.b>S\Q) results in Sel£R.x=?, S.y=?> 

T.a<10, S.b>5\Q). 2) Because the join predicates in P were replaced with filter 
predicates in P* above, the resulting selectivity value becomes separable. Applying the 
separable decomposition property yields Sel n fP 1 i\Qi)- . . r Sel ^P ' *|Qt), where no 

Sel^P 'i\Qi) is separable. 3) Each Sel^P \\Q) contains only filter predicates in P ' h so 
each candidate set of statistics can be found independently. In order to approximate the 
originaKselectivity value with the candidate set of statistics obtained in this way, all Hi 

are joined by the attributes mentioned in the wildcard predicates and the actual range of 
predicates is estimated as in the previous example. 

Once the set of candidate statistics is obtained to approximate a given Sel^(P\Q) y 
the one that is expected to result in the most accurate estimation for SelJJP\Q) must be 
selected, i.e., the statistic H that minimizes the value of error{H, SelJJP\Q)). 



22 



In getSelectivity, error(H,S) returns the estimated level of accuracy of 
approximating selectivity 5 using statistic H. There are two requirements for the 
implementation of error(U, 5). First, it must be efficient, since error(H,S) is called in the 

inner loop of getSelectivity. Very accurate but inefficient error functions are not useful, 
since the overall optimization time would increase and therefore exploiting SITs becomes 
a less attractive alternative. For instance, this requirement bans a technique that looks at 
the actual data tuples to obtain exact error values. 

The second requirement concerns the availability of information to calculate error 
values. At first sight, it is tempting to reformulate error as a meta-estimation technique. 
Then, in order to estimate the error between two data distributions (actual selectivity 
values 5 vs. SIT-approximated selectivity values H) additional statistics, or meta- 
statistics could be maintained over the difference of such distributions. Therefore, 
estimating error(H,S) would be equivalent to approximate range queries over these meta- 

statistics. However, this approach is flawed, since if such meta-statistics existed, they 
could be combined with the original statistic to obtain more accurate results in the first 
place. As an example given H = H R (R.a\pj), approximating S = Sel K (R.a<10\pj,p2). If a 

meta-statistic M is available to estimate values error(H Sel^ci<R,a<c 2 \phP2)), H and 

M can be combined to obtain a new statistic that directly approximates 

Sel£R.a<10\p h p2). 
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Therefore error values must be estimated using efficient and coarse mechanisms. 
Existing information such as system catalogs or characteristics of the input query can be 
used but not additional information created specifically for such purpose. 

Application serial number 10/191,822 introduced an error function, nlnd, that is 
simple and intuitive, and uses the fact that the independence assumption is the main 
source of errors during selectivity estimation. The overall error of a decomposition is 
defined as 5 = Sel {Pj\Qj) •. . .• Sel £P n \Q n ) when approximated, respectively, using 

f£ (A/IQ 9 /),. . H^A n \Q\) (Q ') e Q,), as the total number of predicate independence 

assumptions during the approximation, normalized by the maximum number of 
independence assumptions in the decomposition (to get a value between 0 and 1). In 
symbols, this error function is as follows: 



Each term in the numerator represents the fact that Pi and Q t - Q ) are independent 
with respect to Q h and therefore the number of predicate independent assumptions is 
I P l \ ' I Qi 1 Q V|- In turn, each term in the denominator represents the maximum number 
of independence assumptions when Q ) = 0, i.e \Pi\-\ Qi\. As a very simple example, 
consider 5 = Sel R (R.a<10,R.b>50) and decomposition S = Sel R (R.a<10\R.b>50) • 
Sel R (R,b>50). If base table histograms H(R. a) and H(R.b) are used, the error using nlnd 

. l-(l-0) + l (0-0) 171 ' . \ c ■ A a + - /u . 

is — ^ ■ — — -^=1/1 = 1, i.e., one out of one independence assumptions (between 
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R.a < 10 and R.b> 50). nlnd is clearly a syntactic definition which can be. computed very 
efficiently. 

While nlnd is a very simple metric, often many alternative SITs are given the 
same score, and nlnd needs to break ties arbitrarily. This behavior is problematic when 
there are two or more available SITs to approximate a selectivity value, and while they all 
result in the same "syntactic" nlnd score, the actual benefit of using each of them is 
drastically different, as illustrated in the following example. 

• Consider Rt><R. s =s.s (&s:a<ioS) ><s.t=T.tT, with both foreign-key joins, and the 
following factor of. a decomposition: 5/ = Sel^j$.a<lO\RxS, Sx T). If the only 

candidate SITs to approximate Sj are H } ~ H {RtS }(S.a\RxS) and H 2 - H( RS }(Sm\S><T), 
using the error function nlnd, each statistic would have a score of 1/2, meaning that in 
general each alternative will be chosen at random 50% of the time. However, in this 
particular case Hj will always be more helpful than H 2 . In fact, since Sx S j=tjT is a 
foreign key join, the distribution of attribute S.a over the result of Sxs.t^r.tT is exactly 
the same as the distribution of S. a over base table S. Therefore, Sx s .t=T.tT is actually 
independent of S. a< 1 0 and H 2 provides no benefit over the base histogram H(S. a). 

An alternative error function. Biff, is defined as follows. A single Value, diff H , 
between 0 and 1 is assigned to each available SIT H = H{R.a\Q). In particular, diff H -Q 
when the distribution of R.a on the base table R is exactly the same as that on the result of 
executing query expression Q. On the other hand, diffn = 1 when such distributions are 
very different (note that in general there are multiple possible distributions for which diff H 
= 1, but only one for which diff H = 0). Using devalues, the Diff error function 



generalizes nlnd by providing a less syntactic notion of independence. In particular, the 
overall error value of a decomposition S- Sel(Pj\Qi) Sel K £P„\Q„) when 
approximated using Hi,. , ,H„, respectively is given by: 



X \P, V\Qi \-\\~diff H 
.DmSel(P i \Q l ),H i })=~ i 



.E^i-iai 

. The intuition behind the expression above is that the value \Q\ • {\-diffni) in the 
numerator represents a "semantic" number of independence assumptions when 
approximating Si with H iy and replaces the syntactic value | Q t - Q\ | of nlnd. In fact in 
the previous example, diffui == 0, and Hj effectively contributes the same as a base-table 
histogram H(S.a), so in that case the error function is 1 (the maximum possible value). In 
contrast, for H 2 =H(S.a\R ><S) y the more different the distributions of S.a on S and on the 
result of executing R\>< S, the more likely that Hi encodes the dependencies between S.a 
and^cxS', SxT} , and therefore the lower the overall error value. 

For H ~ H K (a\Q) with K' denoted as <j q (R) (the result of evaluating Q over R), 
diffn can be defined as: ..' , 

xedom(a) 1^1- 1^1 

whereX^) is the frequency of value x in R (diff H is the squared deviation of 
frequencies between the base table distribution and the result of executing fTs query 
expression). It can be shown that 0 < diff H < 1 , and that diff H verifies the properties 
stated above. Values of diff are calculated just once and are stored with each histogram, 



so there is no overhead at runtime, diffu can be calculated when H R(a \Q) is created, 

but that might impose a certain overhead to the query processor to get the/(i?, a). Instead, 
diffii is approximated by carefully manipulating both H and the corresponding base-table 
histogram (which, if it does not exist, can be efficiently obtained using sampling). The 
procedure is similar to calculating the join of two histograms. 

' In essence, Diff 'is a heuristic ranking function and has some natural limitations. 
For instance, it uses a single number (Diff) to summarize the amount of divergence 
between distributions, and it does not take into account possible cancellation of errors 
among predicates. However, the additional information used by Diff makes it more 
robust and accurate than nlnd with almost no overhead. . 

Referring again to the query of Figure 3, the value of Sel{Rs t T}(<5 a , >< rs, 
>< R1 ) is to be estimated where o a and ^represent the filter predicates oyer S.a and T.b, 
respectively, and expand t>< rt represent the foreign key join predicates. Using nlnd 
for errors, getSelectivity returns the decomposition; 5/ = Sel R> sj(^a\^b, ><rs, >« rt) ■ 

Sel RiSi7 ic b \ ><\rs, ><rt) -Sel R xi( ><'rs\><rt) • Sel Rr7 (><RT) using respectively, 
statistics H R ,s(S.a\ >< RS ) H KJ (T;b\>< RT ), {Hr(R.s\ Hs(S.s)} , and {H^RJ), H T (T.t)} . 
Therefore, both available SITs are exploited simultaneously, producing a much more 
accurate cardinality estimate for the original query than any alternative produced by 
previous techniques. 

Integration with an Optimizer 
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The algorithm getSelectivity can be integrated with rule based optimizers. For q 
= o P ] a . . . A^(fc x ), getSelectivity (ft, {pj,. . .,/?*}) returns the most accurate selectivity 

estimation for both q and all its sub-queries, i.e., Sel (P) for all P c {/?/,. ..,/?*} • A. simple 

approach to incorporate getSelectivity into an existing rule-based optimizer is to execute 
getSelectivity before optimization starts, and then use the resulting memoization table to 
answer selectivity requests over arbitrary sub-queries. This approach follows a pattern 
similar to those used by prior art frameworks to enumerate candidate sub-plans, in which 
a first step generates exhaustively all possible equivalent expressions and then, in a 
second phase, the actual search and pruning is performed. It was later established that 
this separation is not useful, since only a fraction of the candidate sub-plans generated . 
during exploration is actually considered during optimization. Instead, newer 
frameworks interleave an exploration by demand strategy with the search and pruning 
phase. 

Cascades is a state-of-the-art rule-based optimization framework. During the 
optimization of an input query, a Cascades based optimizer keeps track of many 
alternative sub-plans that could be used to evaluate the query. Sub-plans are grouped 
together into equivalence classes, and each equivalence class is stored as a separate node 
in a memoization table (also called memo). Thus, each node in the memo contains a list 
of entries representing the logically equivalent alternatives explored so far. Each entry 
has the form [op, {input 7,. . .Jnput n } y {parameter 7,. parameter^], where op is a logical 
operator, such as j oin, input \ is a pointer to some other node (another class of equivalent 
sub-queries), and parameter j is a parameter for operator op. 
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. Figure 5 illustrates a memo that corresponds to an intermediate state while 
optimizing the query SELECT * FROM R, S WHERE R.x=s.Y AND R.a<10 AND 
S.b>5 . The node at the top of the figure groups together all query plans equivalent to 
(0R. a <io(i?)) >< R,x=s.y (<?s.b>5(S)) that were already explored. The first entry in such node 
is [SELECT, {R >< R . x = S y (as.&> 5 (S))}, {R.a<10}], that is, a filter operator, with 
parameter R.a<\Q, applied to the node that groups all equivalent expressions for sub- 
query R >< R. x ^s.y (cfr6>5(S)). Analogously, the second entry corresponds to a join 
operator applied to two other nodes. • 

During optimization, each node in the memo is populated by applying 
transformation rules to the set of explored alternatives. Rules consist of antecedent and 
consequent patterns, and optional applicability conditions. The application of a given 
transformation rule is a complex procedure that involves: (i) finding all bindings in the 
memo, (ii) evaluating rule preconditions, (iii) firing the rule, i.e., replacing the antecedent 
pattern with the consequent pattern, and (iv) integrating the resulting expression (if it is 
new) to the memo table. As -a simple example, the first entry in the node at the top of 
Figure 5 could have been obtained from the second entry by applying the, following 
transformation rule: [Ti] >< (xjp{T 2 ))=> o/>(T 7 >< T 2 ) which pulls put selections. above 
join predicates (Tj and T 2 function as placeholders for arbitrary sub-queries). 

The algorithm getSelectivity can be integrated with a Cascades based optimizer. 
If the optimizer restricts the set of available statistics, e.g., handles only uni-dimensional 
SITs, then getSelectivity can be implemented more efficiently without missing the most 
- accurate decomposition. For uni-dimensional SITs, it can be shown that no atomic 



decomposition Sel^F) = Sel£P'\Q) • SelJQ) with > 1 will have a non-empty 

candidate set of statistics, and therefore be useful. In this case, line 10 in getSelectivity 
can be changed to: 

10 for each P'cP,Q = P-P' such that |P'| < 1 do 

without missing any decomposition. Using this optimization, the complexity of 
getSelectivity is reduced from 0(3 n ) to O^C) • i = 0(n- 2 n l ), and the most accurate 

selectivity estimations will be returned. As a side note, this is the same reduction in 
complexity as obtained when linear join trees during optimization as opposed to bushy 
join trees. 

The search space of decompositions can be further pruned so that getSelectivity 
can be integrated with a cascades based optimizer by coupling its execution with the 
optimizer's own search strategy. This pruning technique is then guided by the 
optimizer's own heuristics, and therefore might prevent getSelectivity from finding the 
most accurate estimation for some selectivity values. However, the advantage is that the 
overhead imposed to an existing optimizer is very small and the overall increase in 
quality can be substantial. 

As explained, for an input SPJ query, # = g pJ a... a^CR*), each node in the 

memoization table of a Cascades based optimizer groups all alternative representations of 
a sub-query of q. Therefore the estimated selectivity of the sub-query represented by n, 
i.e., Sel (P) for Pe {py,. . :,/^}can be associated with each node n in the memo. Each 
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entry in n can be associated to a particular decomposition of the sub-query represented by 
n. 

The node at the top of Figure 5, which represents all equivalent, representations of 
(<Wa</o(R)) t>< R. x =s. y (os.^>5(S)).- The second entry in such node (the join operator) can 
be associated with the following decomposition: Sel{ Rt s}(R.s=S.y | R.a < 10, S.b > 5) • 
Sel{R t s}(R.a < 10, S.b > 5). The first factor of this decomposition is approximated using 
available statistics as already explained. In turn, the second factor is separable and can be 
simplified as Sel{ R j(R.a< 10) - Self S }(S.b > 5). The estimated selectivity of each factor of 
. the separable decomposition is obtained by looking in the corresponding memo nodes 
(the inputs of the join entry being processed). Finally the estimations are multiplied 
together and then by the first factor of the atomic decomposition Self^sjiR-^S.y \R.a< 
10, S.b > 5) to obtain a new estimation for Sel{R r s}(R.s=S.y, R.a < 10, S.b > 5). 

Each entry £ in a memo node n divides the set of predicates P that are represented 
by n into two groups: (i) the parameters of £, that are denoted p c , and (ii) the predicates 
in the set of inputs to £, denoted Q t = P-p c . The entry I in is then associated with the 
decomposition Sel K (P) = SelJ? c \Q c ) Sel n (Qt) where each Sel K (Q t ) is separable into 
Sel (Q l y. . . - ^JQ^), where each Sel^Q^) is associated with the /-th input of I 

In summary, the set of decompositions is restricted in line 10 of getSelectivity to 
exactly those induced by the optimization search strategy. Each time we apply a 
transformation rule thai results in a new entry i in the node associated with Sel^P) to 

obtain the decomposition S - Sel (P^ |Q £ ) Sel n (Q c ). If 5 has the smallest error found so 



far for the current node Sel (P} is updated using the new approximation. Therefore, the 

overhead imposed to a traditional Cascades based optimizer by incorporating 
getSelectivity results from getting, for each new entry £, the most accurate statistic that 
approximates Sel^P t |Q C ). ..' 

So far in this description all input queries have been conjunctive SPJ queries. 
Disjunctive SPJ queries can also be handled by the discussed techniques. For that 
purpose, the identity b pI v p2 (K K ) = K K - a-n p/ A ^ p2 (K K )) is used, and the disjunctive 

query is translated using a de Morgan transformation to selectivity values as Sel K (p\ v P2) 
= 1- &/ R (-ipi , -1P2). The algorithm then proceeds as before with the equality above used 
whenever applicable". For example, decomposition Sel{R t sj}{R<a < 5 v (S.b > IOaT.c^S)) 
is rewritten as 1 - Sel{ RrS j}{R.a > 5, (S b< 10yr.c^5)). The second term is separable and 
is simplified \o Sel{ R }(R.q > 5)- Sel{ S j}(S.b < lOv^.c^S)). The second factor can be , 
transformed again to 1- Sel{ S jj(S.b > 10,r.c=5) which is again separable, and so on. 

The techniques discusses can also be extended to handle SPJ queries with Group- 
By clauses. In the following query. 

SELECT bi,. . .,b n 
'■ FROM R,,...,R n 1 
WHERE pi AND:.. AND pj , 
GROUP BY ai,...,a k 

each bi is either included in {a lr . . .,a k ) or is an aggregate over columns of ft*. The. 
cardinality of q is equal to the number of groups in the output, i.e., the number of distinct 



values (ai,...,a k ) in g p i a ... a p k(R K ), and is obtained by multiplying | ■R^lbythe 
selectivity of the query below: 

SELECT DISTINCT a,,..., a k 

FROM R . R„ 

WHERE pi,...,pj , ... i 

, Thus, to approximate selectivity values of SPJ queries with Group By clauses, the 
selectivity values for SPJ queries must be estimated with set semantics, i.e., taking into 
account duplicate values. The definition of conditional 1 selectivity can be extended to 
handle distinct values as described next. If P is a set of predicates and A is a set of 
attributes, tables(PI A ) is defined as the set of tables referenced either in A or in P, and 
attr(P/ A ) is defined as the attributes either in A or referenced in P. K = {fti,. . .,Rn} is a 

set of tables and P and Q are sets of predicates oyer < R x/ = K\ x . . . x 7^. A and B are sets 
of attributes over K such that attr{PI A ) e 5. The definition of conditional selectivity is 
extended as: 

' SelR(P/,|e/B) = |^^a/<%*(ae(^ x )))|/|7r 5 *^ 

where 7Ta*CR) is a version of the projection operator that eliminates duplicate 

values. 

. The notation of 5 = Sel (PIa\QIb) is simplified, if possible, as follows. If B 
contains all attributes in ft, / 5 is omitted from 5. Similarly, if A = B then l A is omitted 



from 5. Finally, if B contains all attributes in ft and Q = <j> , the selectivity is rewritten as 
5 = SeliPl^), The Value SeliPU) is then the number of distinct A values for tuples in 
GpCR*) divided by |; ft* |. Therefore, for a generic SPJ query with a group-by clause, the 
quantity Sel^ {p h . . .,/?,•/ a} ak ) is to be estimated. 

The atomic decomposition definition can be extended as follows, ft is a set of 

tables, P is a set of predicates oyer ft, and ^ is a set of attributes in ft. Then: 

• .. - i - * ' 

SelJJPI A ) = Sel K (P,/ a \P2/b) • Sel^P 2 l B ) where F/ and ^ partition P and attr(¥,l A ) c 5. 

This generalized atomic decomposition can be integrated with a rule-based 
optimizer that implements coalescing grouping transformations for queries with group-by 
clauses. Coalescing grouping is an example of push-down transformations, which 
typically allow the optimizer to perform early aggregation. In general, such 
transformations increase the space of alternative execution plans that are considered 
during optimization. The coalescing grouping transformation shown in Figure 6 is s 
associated with the following instance of the atomic decomposition property Sel (></a) 

= Set {6Ia\><Ib) •SeLix'/a). For the general case, the >< in the equality is replaced 
with the corresponding set of predicates. 

For SPJ queries the atomic and separable decompositions can be used alone to 
cover all transformations in a rule-based optimizer. In general, the situation is more 
complex for queries with group-by clauses. The separable decomposition property can 
be extended similarly as for the atomic property. In some cases rule-based 
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transformations require the operators to satisfy some semantic properties such as the 
invariant grouping transformation shown in Figure 7 that requires that the join predicate 
be defined over a foreign key of Rj and the primary key of R2. In this case, specific 
decompositions must be introduced that mimic such transformations. Using atomic 
decomposition it is obtained that Sel K (x/ A ) = Sel R (>< | (f> l A ) • Sel^(j)l A ): However, if 

the invariant group transformation SelJ^xl A ) - Sel n (x) • Sel R (<f>/ A ) can be applied as 

well. For that reason the Sel n (x/ A ) = Sel K (><) • Sel n (<f>/ A ) is used which can be easily 

integrated with a rule-based optimizer. This transformation is not valid for arbitrary 
values of P and A 9 but instead holds whenever the invariant grouping transformation can 
be applied. 

In the context of SITs as histograms, traditional histogram techniques can be 
exploited provided that they record not only the frequency but also the number of distinct 
values per bucket. Referring again to Figure 6, Sel^xl A ) = Sef n ((f) I A \ x/b) 

•Sel n (x/s). H K (A\xJ B ) can be used to approximate Sel^(j)l A \ x/b)* In general, to 

approximate Sel n (<p/ A \ Q/b), some candidate SITs of the form HJ,A\Q f l B ) where Q c Q 

are used. 

SITs can be further extended to handle queries with complex filter conditions as 
well as queries with having clauses. The following query asks for orders that were 
shipped no more than 5 days after they were placed. SELECT * FROM orders 
WHERE ship-date - place-date < 5. A good cardinality estimation of the query 
cannot be obtained by just using uni-dimensional base-table histograms over columns 
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ship-date and place -date. The reason is that single-column histograms fail to 
model the strong correlation that exists between the ship and place dates of any particular 
order. A multidimensional histogram over both ship-date = place -date might 
help in this case, but only marginally. In fact, most of the tuples in the two-dimensional 
spaceship-date x place-date will be very close to the diagonal ship-date = 
place-date because most orders are usually shipped a few days after they are placed. 
Therefore, most of the tuples in orders will be clustered in very small sub-regions of 
the rectangular histogram buckets. The uniformity assumption inside buckets would then 
be largely inaccurate and result in estimations that are much smaller than the actual 
cardinality values. 

The scope of SIT(A|Q) can be extended to obtain a better cardinality estimate for 
queries with complex filter expressions. Specifically, A is allowed to be a column 
expression over Q. A column expression over Q is a function that takes as inputs other 
columns accessible in the SELECT clause of Q and returns a scalar value. For instance, a 
SIT that can be used to accurately estimate the cardinality of the query over is H = 
SIT(dif f -date|0 where the generating query Q is defined as: SELECT ship -date 
- p lace- date as diff- date FROM orders. In fact, each bucket in //with range 
[xl- ■ -Xr] and frequency / specifies that / orders were shipped between x L and x R days after 
they were placed. Thus, the cardinality of the query above can be estimated accurately 
with a range query (- oo ? ..,5] oyer H. 

This idea can also be used to specify SITs that help estimating the cardinality of 
queries with group-by and having clauses. The following query: 
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SELECT A, sum (C) 
FROM R 
GROUP BY A 
HAVING avg(B)<10 

conceptually groups all tuples in R by their A values, then estimates the average value of 
B in each group, and finally reports only the groups with an average value smaller than 
10. The cardinality of this query can be estimated using H 2 = SIT(avgB|g2), where the 
generating query Q 2 is defined as: 

SELECT avg(B) as avgB 
FROM R 

GROUP BY A ; 
in this case, Hi is a histogram in which each bucket with range [x L : . ,x R ] and frequency / 
specifies that / groups of tuples from R (grouped by A values) have an average value of B 
between x L and x R . Therefore, the cardinality of the original query above can be 
. estimated with a range query, with range [-00.. . 1 0] , over H2. 

It can be seen from the foregoing description that using conditional selectivity as 
a framework for manipulating query plans to leverage statistical information on 
intermediate query results can result in more efficient query plans. Although the present 
invention has been described with a degree of particularity, it is the intent that the 
invention include all modifications and alterations from the disclosed design falling 
within the spirit or scope of the appended claims. 
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